Search results for "Edit distance"
showing 8 items of 8 documents
Efficient algorithm for learning simple regular expressions from noisy examples
1994
We present an efficient algorithm for finding approximate repetitions in a given sequence of characters. First, we define a class of simple regular expressions which are of star-height one and do not contain union operations, and a stochastic mutation process of a given length over a string of characters. Then, assuming that a given string of characters is obtained corrupted by the defined mutation process from some long enough word generated by a simple regular expression, we try to restore the expression. We prove that to within some reasonable accuracy it is always possible if the length of the mutation process is bounded comparing to the length of the example. We provide an algorithm by…
A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics
2012
International audience; XML similarity evaluation has become a central issue in the database and information communities, its applications ranging over document clustering, version control, data integration and ranked retrieval. Various algorithms for comparing hierarchically structured data, XML documents in particular, have been proposed in the literature. Most of them make use of techniques for finding the edit distance between tree structures, XML documents being commonly modeled as Ordered Labeled Trees. Yet, a thorough investigation of current approaches led us to identify several similarity aspects, i.e., sub-tree related structural and semantic similarities, which are not sufficient…
Vector representation of non-standard spellings using dynamic time warping and a denoising autoencoder
2017
The presence of non-standard spellings in Twitter causes challenges for many natural language processing tasks. Traditional approaches mainly regard the problem as a translation, spell checking, or speech recognition problem. This paper proposes a method that represents the stochastic relationship between words and their non-standard versions in real vectors. The method uses dynamic time warping to preprocess the non-standard spellings and autoencoder to derive the vector representation. The derived vectors encode word patterns and the Euclidean distance between the vectors represents a distance in the word space that challenges the prevailing edit distance. After training the autoencoder o…
BGSA: a bit-parallel global sequence alignment toolkit for multi-core and many-core architectures
2018
Abstract Motivation Modern bioinformatics tools for analyzing large-scale NGS datasets often need to include fast implementations of core sequence alignment algorithms in order to achieve reasonable execution times. We address this need by presenting the BGSA toolkit for optimized implementations of popular bit-parallel global pairwise alignment algorithms on modern microprocessors. Results BGSA outperforms Edlib, SeqAn and BitPAl for pairwise edit distance computations and Parasail, SeqAn and BitPAl when using more general scoring schemes for pairwise alignments of a batch of sequence reads on both standard multi-core CPUs and Xeon Phi many-core CPUs. Furthermore, banded edit distance perf…
Top-k String Similarity Joins
2020
Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types (e.g., numerical values) and the join predicate is equality. In this work, we consider string objects assigned such ranking attributes or simply scores. Given two collection of string objects and a string similarity measure (e.g., the Edit distance), we introduce the top-k string similarity join () which returns k sufficiently similar pairs of objects with respect to a similarity threshold ϵ, which have the highest combined score computed by…
High Locality Representations for Automated Programming
2011
We study the locality of the genotype-phenotype mapping used in grammatical evolution (GE). GE is a variant of genetic programming that can evolve complete programs in an arbitrary language using a variable-length binary string. In contrast to standard GP, which applies search operators directly to phenotypes, GE uses an additional mapping and applies search operators to binary genotypes. Therefore, there is a large semantic gap between genotypes (binary strings) and phenotypes (programs or expressions). The case study shows that the mapping used in GE has low locality leading to low performance of standard mutation operators. The study at hand is an example of how basic design principles o…
Toward Approximate GML Retrieval Based on Structural and Semantic Characteristics
2010
International audience; GML is emerging as the new standard for representing geographic information in GISs on the Web, allowing the encoding of structurally and semantically rich geographic data in self describing XML-based geographic entities. In this study, we address the problem of approximate querying and ranked results for GML data and provide a method for GML query evaluation. Our method consists of two main contributions. First, we propose a tree model for representing GML queries and data collections. Then, we introduce a GML retrieval method based on the concept of tree edit distance as an efficient means for comparing semi-structured data. Our approach allows the evaluation of bo…
Skeleton-Based Multiview Reconstruction
2016
International audience; The advantage of skeleton-based 3D reconstruction is to completely generate a single 3D object from well chosen views. Having numerous views is necessary for a reliable reconstruction but projections of skeletons lead to different topologies. We reconstruct 3D objects with curved medial axis (whose topology is a tree) from the perspective skeletons on an arbitrary number of calibrated acquisitions. The main contribution is to estimate the 3D skeleton, from multiple images: its topology is chosen as the closest to those of the perspective skeletons on the set of images, which means that the number of topology changes to map the 3D skeleton topology to topologies on im…